feat(macOS): add vfkit backend for ephemeral and persistent VMs#259
feat(macOS): add vfkit backend for ephemeral and persistent VMs#259tnk4on wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces macOS support for managing ephemeral and persistent VMs using the vfkit backend and gvproxy for networking. It includes logic for extracting kernels from bootc containers, creating SquashFS root filesystems, and managing VM lifecycles through new CLI subcommands. Feedback highlights security concerns regarding potential command injection during SSH key setup and a TOCTOU race condition in port allocation. Additionally, the use of hardcoded global paths in /private/tmp was flagged as problematic for multi-user environments, and improvements were suggested for handling I/O results when communicating with gvproxy.
| std::path::PathBuf::from("/private/tmp/bcvk/vms") | ||
| } | ||
|
|
There was a problem hiding this comment.
Using a hardcoded global path in /private/tmp/bcvk for VM metadata and sockets is problematic on multi-user systems. It can lead to permission conflicts and security risks if multiple users attempt to run the tool simultaneously. Since podman machine on macOS typically shares the user's home directory by default, consider using a user-specific path like ~/.cache/bcvk/vms or ensuring the directory in /private/tmp is user-private (e.g., by including the UID in the name and setting 0700 permissions).
| "#!/bin/bash\n\ | ||
| mkdir -p /sysroot/var/roothome/.ssh\n\ | ||
| chmod 700 /sysroot/var/roothome/.ssh\n\ | ||
| echo '{}' > /sysroot/var/roothome/.ssh/authorized_keys\n\ | ||
| chmod 600 /sysroot/var/roothome/.ssh/authorized_keys\n\ | ||
| chown -R 0:0 /sysroot/var/roothome/.ssh\n", | ||
| pubkey | ||
| ); |
There was a problem hiding this comment.
The SSH public key is inserted into a shell script using single quotes. While SSH public keys usually do not contain single quotes, a corrupted or maliciously crafted key could lead to command injection within the initramfs environment. A safer approach would be to write the key directly to a file in the CPIO archive and have the script reference that file, or use a heredoc with a quoted delimiter (e.g., cat <<'EOF').
| let mut response = vec![0u8; 1024]; | ||
| let _ = std::io::Read::read(&mut stream, &mut response); | ||
| let response_str = String::from_utf8_lossy(&response); |
There was a problem hiding this comment.
Ignoring the result of the read operation is brittle. It does not account for partial reads or I/O errors. This could lead to incorrect status checks if the response is not fully read in the first chunk or if the connection is closed prematurely.
| let mut response = vec![0u8; 1024]; | |
| let _ = std::io::Read::read(&mut stream, &mut response); | |
| let response_str = String::from_utf8_lossy(&response); | |
| let mut response = vec![0u8; 1024]; | |
| let n = std::io::Read::read(&mut stream, &mut response).context("reading gvproxy response")?; | |
| let response_str = String::from_utf8_lossy(&response[..n]); |
| pub fn find_available_ssh_port() -> u16 { | ||
| use rand::Rng; | ||
| let mut rng = rand::rng(); | ||
| const PORT_RANGE_START: u16 = 2222; | ||
| const PORT_RANGE_END: u16 = 3000; | ||
| for _ in 0..100 { | ||
| let port = rng.random_range(PORT_RANGE_START..PORT_RANGE_END); | ||
| if std::net::TcpListener::bind(("127.0.0.1", port)).is_ok() { | ||
| return port; | ||
| } | ||
| } | ||
| for port in PORT_RANGE_START..PORT_RANGE_END { | ||
| if std::net::TcpListener::bind(("127.0.0.1", port)).is_ok() { | ||
| return port; | ||
| } | ||
| } | ||
| PORT_RANGE_START | ||
| } |
There was a problem hiding this comment.
The find_available_ssh_port function has a Time-of-Check to Time-of-Use (TOCTOU) race condition. The port is checked for availability by binding and then immediately closing it, but it could be taken by another process before gvproxy actually attempts to use it. While the caller has retry logic, this approach can lead to intermittent failures in busy environments.
cgwalters
left a comment
There was a problem hiding this comment.
Thanks so much for starting this!
I only skimmed so far
| @@ -0,0 +1,136 @@ | |||
| //! Cross-platform SSH option types shared between Linux and macOS backends. | |||
| //! | |||
| //! Extracted from ssh.rs to avoid pulling in Linux-only dependencies on macOS. | |||
There was a problem hiding this comment.
Can you do a "prep" PR which refactors out common code?
| if let Err(e) = Command::new("kill") | ||
| .args([&vm.gvproxy_pid.to_string()]) |
There was a problem hiding this comment.
Surely we can just use rustix::process::kill_process please look for other things like this
| print!("Remove all ephemeral VMs? [y/N]: "); | ||
| std::io::stdout().flush()?; | ||
| let mut input = String::new(); | ||
| std::io::stdin().read_line(&mut input)?; | ||
| let input = input.trim().to_lowercase(); | ||
| if input != "y" && input != "yes" { |
There was a problem hiding this comment.
Hmm this may not be a new thing but let's try to use say dialoguer or so
There was a problem hiding this comment.
Agreed. Linux has the same pattern too. How would you like to handle this — separate follow-up?
|
|
||
| /// Options for launching an ephemeral VM via vfkit. | ||
| #[derive(clap::Parser, Debug)] | ||
| pub struct RunEphemeralOpts { |
There was a problem hiding this comment.
Also idelaly share a clap #[flatten] struct w/linux
There was a problem hiding this comment.
Makes sense. There's a good overlap (memory, vcpus, debug, execute, ssh_keygen) but macOS also needs name, kernel_args, gui, and detach which don't exist on Linux. What would be the best way to split it?
| //! | ||
| //! Boot flow: | ||
| //! 1. Extract kernel + initramfs from container image | ||
| //! 2. Create SquashFS rootfs (lz4, cached by digest) |
There was a problem hiding this comment.
The thing is that's O(data) to create whereas to me a key bit of ephemeral today is that it's "cheap" to launch.
Also, we've invested in EROFS for composefs as opposed to squashfs.
I'm not fundamentally opposed to making lookaside disk images (as apple/container does too) in the short term BUT I think in the medium term we really need something efficient.
This also relates to #213 - basically one model here might be where we make a composefs upper and the object store gets backed by remote access to the podman-machine store?
There was a problem hiding this comment.
Thanks for the review. Based on your feedback, this PR has been reworked from the initial SquashFS implementation.
- Adopted a fully diskless architecture — no disk images, shell scripts, or mkfs commands are generated at any point.
- Chose NBD as the transport protocol, using Apple Virtualization.framework's VZNetworkBlockDeviceStorageDeviceAttachment for EFI boot.
- Serve NBD via
podman run -pinside podman machine, reusing gvproxy's TCP port forwarding for NBD traffic between host and VM. - Built a custom nbdkit EROFS plugin from scratch (crates/nbdkit-erofs-plugin) that dynamically generates EROFS rootfs, FAT32 ESP, and GPT partition table from the container overlay directory using the regions pattern.
This approach could also be applied to a Windows/Hyper-V backend.
There was a problem hiding this comment.
One thing still TBD is the plugin distribution method. For local testing, I've been manually placing the .so inside podman machine. A few options I'm considering:
- Bundle in the bcvk RPM. Could be included at podman machine image build time, then bind-mounted into the nbdkit container with
-v. - Ship a dedicated container image with the plugin pre-installed. Adds image maintenance overhead.
- Upstream the plugin to nbdkit. Probably too bcvk-specific to be a good fit.
There may be other approaches too. Any thoughts on the best way to handle this?
There was a problem hiding this comment.
OK, NBD seems like it will work for now.
, reusing gvproxy's TCP port forwarding for NBD traffic between host and VM.
Ideally though we don't involve IP networking for this. I think we could have the VM connect to a unix domain socket instead?
Built a custom nbdkit EROFS plugin from scratch (crates/nbdkit-erofs-plugin) that dynamically generates EROFS rootfs, FAT32 ESP, and GPT partition table
Hmm, but for ephemeral we don't need a GPT partition table (or ESP), we just need any mountable filesystem. We should be doing a direct kernel boot.
from the container overlay directory using the regions pattern.
Ah...interesting. Hmm, I have questions about that but I guess I can toss my own LLM at the code to ask
There was a problem hiding this comment.
One thing still TBD is the plugin distribution method.
Worth noting the larger "we" here control all 3 actors involved here (podman machine, bcvk, and the initramfs inside the guest).
Bundle in the bcvk RPM. Could be included at podman machine image build time, then bind-mounted into the nbdkit container with -v.
There's no RPM on MacOS, and we don't require bcvk installed on podman machine today (it would drag in the virt stack into the podman machine host OS among other things).
I think the thing that would keep the complexity here bundled inside bcvk (ignoring cross-architecture issues which would make this all way more complex) is to bundle the shared library inside our binary, and then use the Podman-machine connection to dynamically inject it into the target VM.
Basically then we can change what happens here at any point just by changing bcvk, no dependency on updates to podman machine.
There was a problem hiding this comment.
Sorry for the delay on this update. I prototyped vsock and ran benchmarks — here are the key results. The user-facing performance difference is negligible for typical workloads, so the choice is more of an architectural decision.
Benchmark results (M1 MAX, vfkit VM, dd 1GB):
- TCP via gvproxy: 938–1031 MB/s
- vsock via libkrun: 575–605 MB/s
This may seem counterintuitive, but TCP runs at the host kernel level via gvproxy, while vsock goes through libkrun's userspace muxer with more hops (4 hops vs 2 for TCP), which explains the gap.
TCP was the original choice for this PR since it works with stock Podman. After the PoC, TCP remains the recommended approach — vsock would also require upstream features that don't exist yet:
- containers/podman: vsock port forwarding support in the machine provider
- containers/krunkit: connect mode for vsock socket creation
If there's interest in revisiting vsock in the future, the PoC code is preserved in the wip/macos-vfkit-vsock branch.
There was a problem hiding this comment.
Hmm, but for ephemeral we don't need a GPT partition table (or ESP), we just need any mountable filesystem. We should be doing a direct kernel boot.
Direct kernel boot on macOS has a consideration: since vfkit runs on the host, kernel and initramfs need to be extracted from the container image (inside podman machine) and written to a shared path (/private/tmp via virtiofs) so vfkit can access them.
This is a tradeoff: the current design generates everything dynamically via nbdkit with no file extraction — the EROFS plugin computes responses on demand from the overlay directory. Direct kernel boot would require writing vmlinuz + initramfs to the host filesystem before launch (cacheable by image digest, so only first-run cost).
There was a problem hiding this comment.
I think the thing that would keep the complexity here bundled inside bcvk (ignoring cross-architecture issues which would make this all way more complex) is to bundle the shared library inside our binary, and then use the Podman-machine connection to dynamically inject it into the target VM.
Implemented. The .so is embedded in the bcvk binary via include_bytes! and on first ephemeral run, bcvk automatically builds a nbdkit container image inside podman machine. No rpm-ostree install or manual .so deployment needed.
The remaining challenge is the .so build itself. It's a nbdkit plugin shared library that can only be built on Linux. For CI, we'll need a Linux job to build the .so and make it available to the macOS/Windows build jobs. For local development, developers need either a Linux environment (e.g. podman run) or a cross-compile toolchain.
There was a problem hiding this comment.
The remaining challenge is the .so build itself. It's a nbdkit plugin shared library that can only be built on Linux.
cargo-zigbuild seems to be increasingly popular, it'd make sense to me to do that by default.
That said, we can obviously support/encourage a flow on mac/windows that uses Linux containers to build.
There was a problem hiding this comment.
I've set up cargo-zigbuild for cross-building the plugin. Added make plugin-so-aarch64 and make plugin-so-x86_64 targets to switch between architectures. Updated the CI workflow accordingly. Tested locally and it passes.
967611a to
4a1c1dd
Compare
|
I think we would need to backfill CI here. One thing that may help significantly is for us to have an opt-in mode that simulates the proposed MacOS architecture, but on Linux - that should be easy to do, we can have a flow that sets up podman machine and runs that way. In fact, we could just make that a first class operation by default - detect if we're using podman machine on Linux and have things Just Work. |
macOS has no KVM/QEMU, so this adds vfkit as the VM backend. Ephemeral VMs use direct kernel boot with SquashFS, persistent VMs use EFI boot. The vfkit/ module mirrors the libvirt/ directory structure, and CLI options match Linux where applicable. Build and run on macOS: cargo build --release codesign -fs - target/release/bcvk Tested on macOS (Apple Silicon) with rootful and rootless podman machine. Assisted-by: Claude Code (Claude Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>
macOS has no KVM/QEMU, so this adds vfkit as the VM backend. Ephemeral VMs use a custom nbdkit EROFS plugin that dynamically generates rootfs, ESP, and GPT from the container overlay via NBD. Persistent VMs use EFI boot. The vfkit/ module mirrors the libvirt/ directory structure, and CLI options match Linux where applicable. Plugin distribution method is TBD. Build and run on macOS: cargo build --release codesign -fs - target/release/bcvk Tested on macOS (Apple Silicon) with rootful and rootless podman machine. Assisted-by: Claude Code (Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>
4a1c1dd to
b1c2573
Compare
Replace the last 2 instances of Command::new("kill") with
rustix::process::kill_process in the --replace VM cleanup path.
All macOS code now uses rustix for process signals, as requested
by maintainer in PR bootc-dev#259 review.
Assisted-by: Claude Code (Claude Opus 4.6)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- to-disk with APFS clonefile-based base disk caching - vm_helpers.rs shared with Windows (12 functions) - nbdkit .so plugin auto-build via include_bytes! embedding - CLI options unified with Linux/Windows (--ssh, --ssh-wait, --force, --stop, --install-log, --label, --format, --itype) Assisted-by: Claude Code (Claude Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>
Agreed, CI is needed. Simulating the macOS architecture on Linux via podman machine is an interesting approach. The podman-machine interaction (nbdkit, to-disk, SSH) should work on Linux as-is, though it would need a QEMU hypervisor backend to replace the vfkit layer. |
I don't see any issues with that, do you? |
No, I don't see any issues either. Let's go with that approach. |
| ExecStart=/bin/sh -c 'mkdir -p /run/systemd/system/sysinit.target.wants && cp /usr/lib/systemd/system/bcvk-journal-stream.service /run/systemd/system/ && ln -s ../bcvk-journal-stream.service /run/systemd/system/sysinit.target.wants/'\n", | ||
| ); | ||
|
|
||
| write_file( |
There was a problem hiding this comment.
This stuff needs to be deduplicated
There was a problem hiding this comment.
Done. Three shared units now reference kit/src/units/ via include_bytes!. journal-stream remains inline since the output device differs between Linux and macOS/Windows.
| Ok(result) | ||
| } | ||
|
|
||
| fn walk_recursive( |
There was a problem hiding this comment.
I prefer using cap-std for stuff like this
There was a problem hiding this comment.
Done. dir_walk.rs now uses cap_std::fs::Dir for traversal. Unix metadata (mode/uid/gid/mtime/nlink) is obtained via rustix::fs::statat since cap-std's Metadata doesn't expose those. Symlink targets use rustix::fs::readlinkat because container overlays contain absolute symlinks that cap-std rejects as outside the boundary.
| use crate::regions::{Region, RegionType}; | ||
| use std::sync::Arc; | ||
|
|
||
| const EROFS_MAGIC: u32 = 0xE0F5E1E2; |
There was a problem hiding this comment.
I think I mentioned this before but we also maintain a huge amount of EROFS Rust code in https://github.com/composefs/composefs-rs/tree/main/crates/composefs/src/erofs which is today specialized for composefs, but we could probably at least try to lift/share some of the code.
Maybe we could factor it out into a crate.
That said there is also https://lib.rs/crates/erofs-rs which is actively developed it seems, but I have not looked closely at it.
There was a problem hiding this comment.
I looked into both composefs-rs and erofs-rs. The on-disk format constants and struct definitions (superblock, inode, dirent) are clearly duplicated across projects. If composefs-rs factored out its format definitions into a standalone crate, bcvk could use that instead of its own. The build logic itself can't be shared — bcvk generates EROFS on demand via NBD pread rather than writing to a file.
- Cross-build .so via cargo-zigbuild (make plugin-so-aarch64/x86_64) - Deduplicate initramfs units with include_bytes! from shared units/ - Use cap-std for directory walking in nbdkit-erofs-plugin - Add per-architecture plugin-so Makefile targets Assisted-by: Claude Code (Claude Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>
ee93f4f to
7477a1d
Compare
| if podman image exists {image}; then exit 0; fi; \ | ||
| mkdir -p /var/tmp/bcvk; \ | ||
| printf '%s' '{b64}' | base64 -d > /var/tmp/bcvk/plugin.so; \ | ||
| printf 'FROM quay.io/fedora/fedora:latest\\nRUN dnf install -y nbdkit nbdkit-basic-plugins && dnf clean all\\nCOPY plugin.so /plugin.so\\n' | \ |
There was a problem hiding this comment.
Hmmm. So this won't ever get updated on an existing machine unless someone prunes the image.
That's probably ok for a PoC, but I suspect it'll bite us down the line.
Obviously, we could ship a pre-built version of the container image from upstream...but then we don't need the .so baked into the binary.
There is a bigger path - we could use https://github.com/vi/rust-nbd (hmm looks like it could use some revitalization).
I mean, we're generating so much code here that I think the NBD server implementation isn't like that much more - and when we do that we can have a single executable binary (not a container image) that we directly run on the target host as a systemd unit?
Edit: Also when we go that route, we don't need to deal with the C interface stuff.
macOS has no KVM/QEMU, so this adds vfkit as the VM backend. Unlike the Linux path which uses podman containers for isolation, macOS launches vfkit directly with per-VM resource separation.
Ephemeral VMs use a fully diskless architecture: a custom nbdkit EROFS plugin (crates/nbdkit-erofs-plugin) dynamically generates EROFS rootfs, FAT32 ESP, and GPT partition table from the container overlay directory, served via NBD. No disk images, shell scripts, or mkfs commands are needed. SSH keys are injected via initramfs CPIO append. Plugin distribution method is TBD.
Persistent VMs use EFI boot with disk images (EFI firmware is provided by vfkit via Apple Virtualization.framework, no external firmware files needed). The vfkit/ module mirrors the libvirt/ directory structure and provides the same subcommands: run, list, ssh, stop, start, rm, rm-all, inspect. Disk images with podman/buildah xattrs (security.selinux) are automatically cleaned before launch since Apple Virtualization.framework rejects them.
The only runtime dependency is Podman — the macOS PKG installer bundles vfkit and gvproxy, so no additional installation is needed. Homebrew is also supported.
Build and run:
No entitlements needed — bcvk launches vfkit as a subprocess.
Tested manually on macOS (Apple Silicon) with rootful and rootless podman machine.
Fixes: #21
Assisted-by: Claude Code (Claude Opus 4.6)